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DETAILED ACTION 

1. Claims 58-62 are pending in the instant application. Claim 63 has been canceled and 
claim 58 has been amended as requested by Applicant in the amendment filed September 14, 
2004. 

Withdrawn Objections and Rejections 

2. Any objection or rejection of record which is not expressly repeated in this action has 
been overcome by Applicant's response and withdrawn. 

Maintained Rejections 
Claim Rejections - 35 USC §101 and § 112 
35 U.S.C 101 reads as follows: 

Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or 
any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and 
requirements of this title. 

3. Claims 58-62 remain rejected under 35 U.S.C. 101 because the claimed invention is not 
supported by either a specific and substantial asserted utility or a well established utility, for 
reasons of record in the previous office action, mailed May 20, 2004, at pages 4-8 and below. 

Applicants* arguments (pages 10-19, Paper filed Sept, 14, 2004) have been fully 
considered but are not deemed persuasive. 

Applicants traverse the rejection and discuss the legal standard for utility on pages 10-11, 
and starting on page 12 discuss the proper application of the legal standard* Applicants rely on 
the gene amplification data for patentable utility for the PR0274 protein and antibodies thereof, 
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and explain the gene amplification assay of Example 1 14, in which PR0274 is amplified more 
than two fold in three types of human primary lung tumors, which Applicants assert is significant 
and that the PR0274 gene has utility as a diagnostic of lung cancer. Applicants provide the 
Declaration by Dr. Audrey Goddard, in which she states that a gene identified as being amplified 
at least 2-fold by the quantitative TaqMan PCR assay in a tumor sample relative to a normal 
sample is useful as a marker for the diagnosis of cancer. Applicants assert that as the TaqMan 
realtime PCR method has gained wide recognition for its versatility, sensitivity and accuracy, 
arid is in extensive use for the study of gene amplification, one of ordinary skill in the art would 
find it credible that PR0274 is a diagnostic marker of human lung cancer. 

The Goddard Declaration filed under 37 CFR LI 32, filed Sept. 14, 2004 is insufficient to 
overcome the rejection of claims 58-62 as set forth in the last Office action because: while the 
declaration and supporting references are convincing that the TaqMan realtime PCR method is 
very sensitive and can identify amplified genes, the claims are drawn to antibodies to protein 
encoded by the PR0247 gene, and as discussed in the previous office action and below, it is not 
predictable that gene amplification results in increased rnRNA expression, or that increased 
mRNA expression results in increased protein production. 

Applicant argues that the Gygi et al. publication does not support the rejection. Applicant 
characterizes Gygi et al. as teaching that there is a general trend but no strong correlation 
between polypeptide expression level and transcript level. Applicant further characterizes Gygi 
et al's conclusions as showing that there is a positive correlation between transcript and 
polypeptide for most of the 150 yeast polypeptides studied, but the correlation is not linear and 
thus one cannot accurately predict polypeptide levels from mRNA levels. Applicant concludes 
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that Gygi et al. show that it is more likely than not that a positive correlation exists between 
mRNA and polypeptide levels. This has been fully considered but is not found to be persuasive. 
In the instant case, the specification provides data showing a very small increase in DNA copy 
number, approximately 2-fold, in a few tumor samples for PR0274. There is no evidence 
regarding whether or not the PR0274 mRNA or polypeptide levels are also increased in these 
tumor samples. Since the instant claims are directed to antibodies to PR0274 polypeptide, it 
was imperative to find evidence in the relevant scientific literature whether or not a small 
increase in DNA copy number would be considered by the skilled artisan to be predictive of 
increased in mRNA and polypeptide levels. Pennica et al. was cited as evidence showing a lack 
of correlation between gene (DNA) amplification and elevated mRNA levels. Gygi et ah was 
cited as providing evidence that polypeptide levels cannot be accurately predicted from mRNA 
levels, and that variances as much as 40-fold or even 50-fold were not uncommon. Given the 
small magnitude by which the DNA copy number of PR0274 is increased, and the evidence 
provided by Gygi et al. and Pennica et al., it is clear that one skilled in the art would not assume 
that a small increase in gene copy number would correlate with significantly increased mRNA or 
polypeptide levels. One skilled in the art would do further research to determine whether or not 
the PR0274 polypeptide levels increased significantly in the tumor samples. The requirement 
for such further research requirements makes it clear that the asserted utility is not yet in 
currently available form, i.e., it is not substantial This further experimentation is part of the act 
of invention and until it has been undertaken, Applicant's claimed invention is incomplete. The 
instant situation is directly analogous to that which was addressed in Brenner v. Manson, 148 
U.S.P.Q. 689 (Sus. Ct, 1966), in which the court held that: 
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"The basic quid pro quo contemplated by the Constitution and the Congress for 
granting a patent monopoly is the benefit derived by the public from an invention 
with substantial utility", "[u]nless and until a process is refined and developed to 
this point-where specific benefit exists in currently available form-there is 
insufficient justification for permitting an applicant to engross what may prove to 
be a broad field", and "a patent is not a hunting license", "[i]t is not a reward for 
the search, but compensation for its successful conclusion." 

Applicant refers to three additional articles (Orntoft et al., Hyman et al and Pollack et al.) 
as providing evidence that gene amplification generally results in elevated levels of the encoded 
polypeptide. Applicant characterizes Orntoft et al. as teaching in general (18 of 23 cases) 
chromosomal areas with more than 2-fold gain of DNA showed a corresponding increase in 
mRNA transcripts. Applicant characterizes Hyman et al. as providing evidence of a prominent 
global influence of copy number changes on gene expression levels. Applicant characterizes 
Pollack et al. as teaching that 62% of highly amplified genes show moderately or highly elevated 
expression and that, on average, a 2-fold change in DNA copy number is associated with a 1.5- 
fold change in mRNA levels. This has been fully considered but is not found to be persuasive. 
Orntoft et al. appear to have looked at increased DNA content over large regions of 
chromosomes and comparing that to mRNA and polypeptide levels from the chromosomal 
region. Their approach to investigating gene copy number was termed CGH. Orntoft et al. do 
not appear to look at gene amplification, mRNA levels and polypeptide levels from a single gene 
at a time. The instant specification reports data regarding amplification of individual genes, 
which are not likely to be in a chromosomal region which is highly amplified, given the low 
ACT values. Orntoft ct al. concentrated on regions of chromosomes with strong gains of 
chromosomal material containing clusters of genes (p. 40). This analysis was not done for 
PR0274 in the instant specification. That is, it is not clear whether or not PR0274 is in a gene 
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cluster in a region of a chromosome that is highly amplified. Therefore, the relevance of Orntoft 
et al is not clear. Hyman et al. used the same CGH approach in their research. Less than half 
(44%) of highly amplified genes showed mRNA overexpression (abstract). Polypeptide levels 
were not investigated. Therefore, Hyman et al. also do not support utility of the claimed 
antibodies to the polypeptides. Pollack et al. also used CGH technology, concentrating on large 
chromosome regions showing high amplification (p. 12965). Pollack et al. did not investigate 
polypeptide levels. Therefore, Pollack et al. also do not support the asserted utility of the 
claimed invention. Importantly, none of the three papers reported that the research was relevant 
to identifying probes that can be used as cancer diagnostics. The three papers state that the 
research was relevant to the development of potential cancer therapeutics, but also clearly imply 
that much further research was needed before such therapeutics were in readily available form. 
Accordingly, the specification's assertions that the claimed PR0274 antibodies have utility in the 
fields of cancer diagnostics and cancer therapeutics are not substantial. 

The Polakis declaration under 37 CFR 1.132 filed Sept. 14, 2004 is insufficient to 
overcome the rejection of claims 58-62 based upon 35 U.S.C. §§ 101 and 1 12, first paragraph, as 
set forth in the last Office action for the following reasons: 

Applicant presents a declaration by Dr. Polakis filed with the response under 37 CFR 
1.132. In the declaration, Dr. Polakis states that the primary focus of the Tumor Antigen Project 
was to identify tumor cell markers useful as targets for cancer diagnostics and therapeutics. Dr. 
Polakis states that approximately 200 gene transcripts were identified that are present in human 
tumor cells at significantly higher levels than in corresponding normal human cells. Dr. Polakis 
states that antibodies to approximately 30 of the tumor antigen polypeptides have been 
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developed and used to show that approximately 80% of the samples show correlation between 
increased mRNA levels and changes in polypeptide levels. Dr. Polakis states that it remains a 
central dogma in molecular biology that increased mRNA levels are predictive of corresponding 
increased levels of the encoded polypeptide. Dr. Polakis characterizes the reports of instances 
where such a correlation does not exist as exceptions to the rule. This has been fully considered 
but is not found to be persuasive. First, it is important to note that the instant specification 
provides no information regarding increased mRNA levels of PR0274 in tumor samples relevant 
to normal samples. Only gene amplification data was presented. Therefore, the declaration is 
insufficient to overcome the rejection of claims 58-62 based upon 35 U.S.C. §§ 101 and 1 12, 
first paragraph, since it is limited to a discussion of data regarding the correlation of mRNA 
levels and polypeptide levels, and not gene amplification levels and polypeptide levels. 
Furthermore, the declaration does not provide data such that the examiner can independently 
draw conclusions. Only Dr. Polakis' conclusions are provided in the declaration. There is no 
evidentiary support to Dr. Polakis* statement that it remains a central dogma in molecular 
biology that increased mRNA levels are predictive of corresponding increased levels of the 
encoded polypeptide. Finally, it is noted that the literature cautions researchers from drawing 
conclusions based on small changes in transcript expression levels between normal and 
cancerous tissue. For example, Hu et al. (2003, Journal of Proteome Research 2:405-412) 
analyzed 2286 genes that showed a greater than 1-fold difference in mean expression level 
between breast cancer samples and normal samples in a micoarray (p. 408, middle o f right 
column). Hu et al. discovered that, for genes displaying a 5-fold change or less in tumors 
compared to normal, there was no evidence of a correlation between altered gene expression and 
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a known role in the disease. However, among genes with a 10-fold or more change in expression 
level, there was a strong and significant correlation between expression level and a published 
role in the disease (see discussion section). PRO 274 does not display a 10-fold or greater 
amplification, according to the specification. 

Applicants futher assert that even if one assumes that it is more likely than not that there 
is no correlation between gene amplification and increased mRNA/protein expression, a 
polypeptide encoded by a gene that is amplified in cancer would still have a specific and 
substantially utility, and provides the declaration by Dr. Avi Ashkenazi. Dr. Ashkenazi explains 
that even when amplification of a cancer marker gene does not result in significant over- 
expression of the corresponding gene product, this very absence of gene product over-expression 
still provides significant information for cancer diagnosis and treatment, in that if the gene 
product is over-expressed in some tumor types but not others, this would enable more accurate 
tumor classification and hence better determination of suitable therapy, and additionally, if a 
gene is amplified by the corresponding gene product is not-overexpressed, the clinician 
accordingly will decide not to treat a patient with agents that target that gene product 

The declaration filed under 37 CFR 1.132 filed Sept. 14, 2004 is insufficient to overcome 
the rejection of claims 58-62 based upon lack of utility as set forth in the last Office action 
because: it has not been demonstrated that the protein of the instant invention is differentially 
expressed in different tumors. If it was, the protein would have a specific and substantial utility 
for tumor classification, but the mere assertion that it may be differentially expressed does not 
provide a specific and substantial utility, and is an invitation to experiment. The argument that if 
a gene is amplified but the gene product is not over-expressed, the clinician would accordingly 
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will decide not to treat a patient with agents that target the gene product is also insufficient to 
overcome the rejection of the claims. If a specific gene product was known to be involved in 
cancer and if there were known compounds that could be used to target the gene product, this 
would be an acceptable utility. However, the gene product of the instant invention has not been 
demonstrated to be involved in cancer. Over-expression of a gene product in a cancer cell does 
not necessarily mean that the gene product is involved in the cancer and that targeting the gene 
product would be therapeutic. Additionally* there are no known compounds that would target 
the gene product. 

Applicants provide the Hanna et al. reference to support the Declaration of Dr. 
Ashkenazi. The Hanna reference is not applicable to the instant fact situation, as it deals with a 
known tumor associated gene, and not with a prospective analysis of the type found in this 
specification. 

The proposed uses of the claimed invention are simply starting points for further 
research and investigation into potential practical uses of the claimed polypeptides. For all of 
these reasons, the rejections are maintained. 

The following is a quotation of the first paragraph of 35 HS.C. 1 12: 

The specification shall contain a written description of the invention, and of the manner and process of making 
and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it 
pertains, or with which it is most nearly connected, to make and use the same and shall set forth the best mode 
contemplated by the inventor of carrying out his invention. 

4. Claims 58-62 also remain rejected under 35 U.S.C. 112, first paragraph. Specifically, since 
the claimed invention is not supported by either a specific and substantial asserted utility or a 
well established utility for the reasons set forth above, one skilled in the art clearly would not 
know how to use the claimed invention. 
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Rejections over Prior Art 
Claim Rejections - 35 USC §102 and § 103 

The text of those sections of Title 35, U.S. Code not included in this action can be found 

in a prior Office action. 

5.1 Claims 58-62 remain rejected under 35 U,S,C 102(b) as being anticipated by Ho et al, 
Science, Vol. 289, July 14, 2000, pages 265-270, for reasons of record in the previous Office 
Action, mailed May 20, 2004, at page 10, and below. 

5.2 Claims 59-62 remain rejected under 35 U.S.C 103(a) as being unpatentable over Ho et 
al., Science, Vol. 289, July 14, 2000, pages 265-270, in view of Immunobiology, The Immune 
System in Health and Disease, Third Edition, Janeway, And Travers, Ed,, 1997, for reasons of 
record in the previous Office Action, mailed May 20, 2004, at page 10, and below. 

Applicants traverse rejections and assert that they rely on the gene amplification assay for 
patentable utility which was first disclosed in International Application no. PCT/US00/03565, 
filed Feb. 1 1, 2000, and assert that they are entitled to at least that filing date, so that Ho et al. is 
not prior art. Applicants' arguments have been fully considered but are not deemed persuasive, 
because the gene amplification assay fails to provide a patentable utility for the antibodies to the 
protein, for reasons discussed above. 

It is believed that all pertinent arguments have been answered. 



6. No claim is allowed. 



Conclusion 
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THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time 
policy as set forth in 37 CFR 1.136(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within TWO 
MONTHS of the mailing date of this final action and the advisory action is not mailed until after 
the end of the THREE-MONTH shortened statutory period, then the shortened statutory period 
will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 
CFR 1 . 1 36(a) will be calculated from the mailing date of the advisory action, In no event, 
however, will the statutory period for reply expire later than SIX MONTHS from the mailing 
date of this final action. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Eileen B. O'Hara, whose telephone number is (571) 272-0878. 
The examiner can normally be reached on Monday through Friday from 10:00 AM to 6:30 PM. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Brenda Brumback can be reached at (571) 272-0961. 

The fax phone number for the organization where this application or proceeding is 
assigned is 703-872-9306. 

Any inquiry of a general nature or relating to the status of this application should be 
directed to the Group receptionist whose telephone number is (571) 272-1600. 

Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
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may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR 
system, see http://portal.uspto.gov/external/portal/pair. Should you have questions on access to 
the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll 
free). 

Eileen B. O'Hara, Ph.D. 
Patent Examiner 






Application/Control No* 


Applicant(s)/Patent Under 




09/978,192 


Reexamination 




Notice of References Cited 


ASHKENAZI ETAL 


Examiner 


Art Unit 






Eileen O'Hara 


1646 


Page 1 of 1 



U.S. PATENT DOCUMENTS 







Document Number 
Country Code-Number-Kind Code 


Date 
MM-YYYY 


Name 


Classification 




A 


us- 










B 


us- 










C 


us- 










D 


us- 










E 


us- 










F 


us- 










G 


us- 










H 


us- 










I 


us- 










J 


us- 










K 


us- 










L 


us- 










M 


us- 












FOREIGN PATENT DOCUMENTS 






Document Number 
Country Code-Number-Kind Code 


Date 
MM-YYYY 


Country 


Name 


Classification 




N 














0 














P 














Q 














R 














S 














T 
















NON-PATENT DOCUMENTS 


* 




Include as applicable: Author, Title Date, Publisher, Edition or Volume, Pertinent Pages) 




u 


, Hu et al. (2003, Journal of Proteome Research 2:405-412. 




V 






w 






X 




A copy of this reference is not being furnished with this Office action. (See MPEP § 707.05(a).) ™ 

Dates in MM-YYYY format are publication dates, Classifications may be US or foreign. 



U.S. Patent and Trademark Office 

PTO-892 (Rev. 01-2001) Notice of References Cited Partof Paper No. 11192004 



Journal of g 

research articles DrOteOme 

'research 



Analysis of Genomic and Proteomic Data Using Advanced Literature 

Mining 
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High-throughput technologies, such as proteomic screening and DNA micro-arrays, produce vast 
amounts of data requiring comprehensive analytical methods to decipher the biologically relevant 
results. One approach would be to manually search the biomedical literature; however, this would be 
an arduous task. We developed an automated literature-mining tool, termed MedGene, which 
comprehensively summarizes and estimates the relative strengths of all human gene-disease 
relationships in Medline. Using MedGene, we analyzed a novel micro-array expression dataset 
comparing breast cancer and normal breast tissue in the context of existing knowledge. We found no 
correlation between the strength of the literature association and the magnitude of the difference in 
expression level when considering changes as high as 5-fold; however, a significant correlation was 
observed (r = 0.41; p = 0.05) among genes showing an expression difference of 10-fold or more. 
Interestingly, this only held true for estrogen receptor (ER) positive tumors, not ER negative. MedGene 
identified a set of relatively understudied, yet highly expressed genes in ER negative tumors worthy of 
further examination. 

Keywords: bioinformatics • micro-array • text mining • gene-disease association • breast cancer 



Introduction 

At its current pace, the accumulation of biomedical literature 
outpaces the ability of most researchers and clinicians to stay 
abreast of their own immediate fields, let alone cover a broader 
range of topics. For example, to follow a single disease, e.g.. 
breast cancer, a researcher would have had to scan 130 different 
journals and read 27 papers per day in 1999. 1 This problem is 
accentuated with high-throughput technologies such as DNA 
micro-arrays and proteomics, which require the analysis of 
large datasets involving thousands of genes, many of which are 
unfamiliar to a particular researcher. In any microarray experi- 
ment, thousands of genes may demonstrate statistically sig- 
nificant expression changes, but only a fraction of these may 
be relevant to the study. The ability to interpret these datasets 
would be enhanced if they could be compared to a compre- 
hensive summary of what is known about all genes. Thus, there 
is a need to summarize existing knowledge in a format that 
allows for the rapid analysis of associations between genes and 
diseases or other specific biological concepts. 

One solution to this problem is to compile structured digital 
resources, such as the Breast Cancer Gene Database 1 and the 
Tumor Gene Database. 2 However, as these resources are hand- 
curated, the labor- Intensive review process becomes a rate- 
limiting step in the growth of the database. As a result, these 
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databases have a limited scale and the genes are not selected 
in a systematic fashion. 

An alternative approach is automated text mining; a method 
which involves automated information extraction by searching 
documents for text strings and analyzing their frequency and 
context. This approach has been used successfully in several 
instances for biological applications. In most cases, it has been 
applied to extract Information about the relationships or 
interactions that proteins or genes have with one another, in 
the literature or by functional annotation, 3 ™ 7 Thus far, few 
publication have applied text-mining to examine the global 
relationships between genes and diseases. Perez-Iratxeta et al. 
automatically examined the GO (Gene Ontology) annotation 
of genes and their predicted chromosomal locations in order 
to identify genes linked to inherited disorders. 8 

To obtain a more global understanding of disease develop- 
ment, it would be valuable to incorporate information regarding 
all possible gene-disease relationships, including biochemical, 
physiological* pharmacological, epidemiological, as well as 
genetic. This information would enable comprehensive com- 
parisons between large experimental datasets and existing 
knowledge in the literature. This would accomplish two things. 
First, it would serve to validate experiments by demonstrating 
that known responses occur as predicted. Second, it would 
rapidly highlight which genes are corroborated by the literature 
and which genes are novel in a given context. We have utilized 
a computational approach to literature mining to produce a 

Journal of Proteome Research 2003, 2, 405-412 405 
Published on Web 06/13/2003 
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comprehensive set of gene-disease relationships. In addition, 
we have developed a novel approach to assess the strength of 
each association based on the frequency of citation and co- 
citation, We applied this tool to help interpret the data from a 
large micro-array gene expression experiment comparing 
normal and cancerous breast tissue. 

Methods 

MedGene Database. MedGene is a relational database, stor- 
ing disease and gene information from NCBI, text mining re- 
sults, statistical scores, and hyperlinks to the primary lit- 
erature. MedGene has a web-based user interface for users to 
query the database (http://hipseq.med.harvard.edu/MedGene/). 

Text Mining Algorithms. MeSH files were downloaded from 
the MeSH web site at NLM (Nation Library of Medicine) (http:// 
www.nim.nih.gov/mesh/meshhome.html) and human disease 
categories were selected. LocusLlnk files were downloaded from 
the LocusLink web site at NCBI (http://www.ncbi.nih.gov/ 
LocusLink/). Official/preferred gene symbol, official/ preferred 
gene name, and gene alternative symbols and names, all 
relevant annotations and URLs for each LocusLink record, were 
collected. Gene search terms were used for literature searching 
and included all qualified gene names, gene symbols, and gene 
family terms. Primary gene keys, predominantly qualified gene 
family terms and gene official/preferred symbols, were used 
to index Medline records. If the official/preferred gene symbols 
did not meet the standards to be an index, then qualified gene 
official/preferred names were used. A local copy of Medline 
records (up to July, 2002) was pre-selected. 

A JAVA module examined the MeSH terms and then indexed 
each Medline record with the appropriate disease terms. A 
separate JAVA module was used to examine the titles and 
abstracts for gene search terms and then to index the gene- 
related Medline records with the relevant primary gene key(s). 

Statistical Methods. For every gene and disease pair, we 
counted records that were indexed for both gene and disease 
(double positive hits), for disease only (disease single hits), for 
gene only (gene single hits), and for neither gene nor disease 
(double negative hits) to generate a 2 x 2 contingency table. 
On the basis of the contingency table- framework, we applied 
different statistical methods to estimate the strength of gene- 
disease relationships and evaluated the results. These methods 
included chl-square analysis, Fisher's exact probabilities, rela- 
tive risk of gene, and relative risk of disease 16 (http:// 
hipseq.med.harvard.edu/MedGene/). In addition, we computed 
the "product of frequency", which is the product of the 
proportion of disease/gene double hits to disease single hits 
and the proportion of disease/gene double hits to gene single 
hits. To obtain a normal distribution, we transformed all the 
statistical scores using the natural logarithm. We selected the 
log of the product of frequency (LPF) to validate MedGene and 
to use for the analysis with the micro-array data. Spearman 
rank correlation coefficients were used to assess the linear 
relationship between LPF and micro- array fold change In 
expression level. 

Global Analysis. Diseases with at least 50 related genes were 
selected for clustering analysis, and the LPF scores were 
normalized with total score for each disease, Hierarchical 
clustering was done with the "Cluster" software and the 
clustering result was visualized using TreeViewer" (http:// 
rana.Ibl.gov/EisenSoftware.htm). 
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Breast Tissue Micro- Arrays. Eighty- nine breast cancer 
samples (79% ER-positive) and 7 normal breast tissue samples 
were selected from the Harvard Breast SPORE frozen tissue 
repository and were representative of the spectrum of histo- 
logical types, grades, and hormone receptor irnmuno-pheno- 
types of breast cancer, Biotinylated cRNA, generated from the 
total RNA extracted from the bulk tumor, was hybridized to 
Affymetrix U95A oligo- nucleotide micro- arrays. These micro- 
arrays consist of 12 400 probes, which represent approximately 
9000 genes. Raw expression Values were obtained using GENE- 
CHIP software from Affymetrix, and then further analyzed using 
the DNA-Chip Analyzer (dChip) custom software. 

Results 

Automated Indexing of Medline Records by Disease and 
Gene. To study the gene-disease associations in the literature, 
we first compiled complete lists for human diseases and human 
genes. To index all Medline records that were relevant to 
human diseases, the Medical Subject Heading (MeSH) index 
of Medline records was utilized. MeSH is a controlled medical 
vocabulary from the National Library of Medicine and consists 
of a set of terms or subject headings that are arranged in both 
an alphabetic and an hierarchical structure. Medline records 
are reviewed manually and MeSH terms are added to each with 
software assistance, 9 * 10 Twenty-three human disease category 
headings along with all of their child terms (see the Supporting 
Information, Supplemental Table 1, or visit http://hipseq. 
med.harvard.edu/MedGene/publication/s_Table l.html) were 
selected from the 2002 MeSH index creating a list of 4033 
human diseases. 

No index comparable to the MeSH index exists for genes, 
and thus, it was necessary to apply a string search algorithm 
for gene names or symbols found in Medline text. A complete 
list of genes, gene names, gene symbols, and frequently used 
synonyms were collected from the LocusLink database at 
NCBI, 1U2 which contains 53 259 independent records keyed 
by an official gene symbol or name (June 18 th , 2002). For the 
purposes of this study, no distinction was made between genes 
and their gene products. Authors often use the same name for 
both, differentiating the two only by the use of italics, if at all. 
For the intended use of this study, this lack of distinction is 
unlikely to have a large effect and may in fact be beneficial. 

Initial attempts to search the literature using these lists 
revealed several sources of false positives and false negatives 
(Table 1). False positives primarily arose when the searched 
term had other meanings, whereas false negatives arose from 
syntax discrepancies necessitating the development of filters 
to reduce these errors. The syntax issues were readily handled 
by including alternate syntax forms in the search terms. The 
false positive cases, caused by duplicative and unrelated 
meanings for the terms, were more difficult to manage. Where 
possible, case sensitive string mapping reduced inappropriate 
citations. In many cases, however, this was not sufficient and 
the terms had to be eliminated entirely, thereby reducing the 
false positive rate but unavoidably under-representing some 
genes. 

For the purposes of data tracking, a primary gene key was 
selected to represent all synonyms that correspond to each 
gene. Medline records were Indexed with a primary gene key 
when any synonym for that key was found in the title or 
abstract. Case-insensitive string mapping was used for all 
searches except as noted above. No additional weight was 
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Table 1. Systematic Sources of False Positives and False Negatives in Unfiltered Data a 



source of error 



error type 



example 



filter solution 



gene symbol /name 
is not unique 



gene symbol is 

unrelated abbreviation 
gene symbol/name 

has language meaning 
nonstandard syntax 
unofficial gene name/symbol 
nonspecified gene name 



false positive MAG— myelin 

associated glycoprotein 
MA G— malignancy-associated 
protein 

false positive PA— pallid homologue (mouse), 

pallidin (also abbrev. for Pennsylvania) 

false positive WAS— Wiskott-Aldrich Syndrome 

(also the word "was") 

false negative BAG-1 instead of BAGl 

false negative P53 instead of TP53 

false negative estrogen receptor instead of 
Estrogen receptor 1 



eliminate this term 

eliminate this term 

case-sensitive string search 

add dash term 

add all gene nicknames 

add family stem term 



a In preliminary studies, Medline was searched for co-occurrence of genes and diseases and the resulting output was evaluated to identify error sources that 
were amenable to global filters. Each error source Is categorized by the type of error it causes: false positives are suggested relationsliips that are not real and 
false negatives are real relationsiiips that arc underrepresented. The filter solutions used are indicated. Note that in some cases, the filter solution itself introduces 
error. In general, error rates maximized sensitivity, even at the expense of specificity if needed. 



added for multiple occurrences of a term or the co- occurrence 
of multiple synonyms for the same gene key. 

Medline records were searched with all qualified gene 
identifiers, such as the official/preferred gene symbol, the 
official/preferred gene name, all gene nicknames and all syntax 
variants. In situations where there are several members of a 
gene family or splice variants, some authors prefer to use a 
shortened gene family name, e.g., estrogen receptor instead of 
estrogen receptor 1 (ESRi), creating a source of false negatives. 
For this reason, gene family stem terms were created for all 
genes that have an alpha or numerical suffix (e.g., IL2RA, TGFf} t 
ESRL etc.) and then used to search the literature. The family 
stem terms were handled separately from the specific gene 
names so that it would be clear when linkages were made to 
the gene family versus a specific member in that family. 

To improve performance and accuracy, some pre-selection 
was applied to the records that were scanned. First, review 
articles were eliminated to avoid redundant treatment of 
citations, Second, non-English journals were removed because 
the natural language filters were only relevant to English 
publications. Finally, journals unlikely to contain primary data 
about gene-disease relationships were also removed (e.g., Int. 
J. Health Educ, Bedside Nurse, and / Health Econ). Together, 
these filters reduced the 12 198 221 Medline publications (|uly 
2002) by 37%. 

Ranking the Relative Strengths of Gene- Disease Associa- 
tions. In total, there were 618 708 gene-disease co-citations, 
in which 16% (8297) of all studied genes had been associated 
to a disease and 96% (3875) of all diseases had been associated 
to at least one gene. To rank the relative strengths of gene 
disease relationships, we tested several different statistical 
methods and examined the results. With the exception of the 
relative risk estimates, the methods provided similar results 
with respect to the rank order of the gene-disease association 
strengths. However, after comparing the results to other 
databases and after consulting disease experts, the log of the 
product of frequency (LPF) was selected for further analysis 
because it gave the best results overall. 

Validation of MedGene. In developing this tool, it was 
important to minimize the number of missed genes (false 
negatives) and miscalled genes (false positives). However, in 
situations when these goals were in conflict, inclusiveness was 
prioritized. To determine the false negative rate In MedGene, 
breast cancer was used as a test case because it was associated 
with more genes than any other human disease and because 




Figure 1. Estimation of the false negative rate by comparison 
with hand-curated databases. The breast cancer-related genes 
identified by MedGene were compared with those listed in 
several other databases including the Tumor Gene Database 
(TGD), 2 the Breast Cancer Gene Database(BCG), 1 GeneCards 
(GC) 17 and Swlssprot, 1fl Genes were considered false negatives 
if they were represented in at least one of these other databases 
and not in MedGene and their link to breast cancer was sup- 
ported by at least one literature reference. All literature references 
were verified by manual review to confirm their validity. The 
number of genes in each database or shared by more than one 
database is indicated. The false negative rate was calculated by 
genes missed at MedGene (26)/total number of nonoveriapping 
genes in other databases (285). 

there were several public databases that link genes to breast 
cancer. We compared the list of breast cancer-related genes 
from MedGene to these databases, illustrated in Figure 1. 
Among the 285 distinct breast cancer-related genes that were 
supported by at least one literature citation in these hand- 
curated databases, 26 were absent from MedGene, suggesting 
a false negative rate of approximately 9%. To determine why 
these were missed, ail literature references for these genes (80 
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papers) were reviewed manually (see the Supporting Informa- 
tion, Supplemental Table 2, or visit http://hipseq.med. 
harvard.edu/MedGene/publication/s_Table 2.html), Among 
these papers, most false negatives were caused by nonstandard 
gene terms or gene terms eliminated by our specificity filters. 
Few genes were missed because they were only mentioned in 
review papers (0,4%) or they appeared only in the body of the 
manuscript but not the abstract or title (1.1%). Of note, 
MedGene identified approximately 2000 additional breast 
cancer-related genes not listed in any other database. 

To assess the false positive error rate, two complementary 
approaches were used: a detailed analysis of one disease and 
a global examination of 1000 diseases. The detailed approach 
examined the false positive error rate and its sources, whereas 
the global approach tested whether the overall results made 
biomedical sense. 

Using the LPF, 1467 genes related to prostate cancer were 
assembled in rank order. We then retrieved approximately 300 
Medline records each for the highest ranked 100 and the lowest 
ranked 200 genes and manually reviewed the titles and 
abstracts to determine the verity of the association. Nearly 80% 
of the highest ranked 100 genes fell into one of the five 
categories that reflect meaningful gene-disease relationships 
(see the Supporting Information, Supplemental Table 3, or visit 
http://hipseq.med.harvard.edu/MedGene/publication/ 
s_Table 3, html). Among the lowest ranked 200 genes, ap- 
proximately 70% reflected true relationships, Of the 600 records 
reviewed, there were only two in which the association between 
the gene and the disease was described as negative, Both were 
genes with very low scores. In both cases, the authors did not 
argue the absence of any relationship, but rather that a 
particular feature of the gene or protein was not shown to be 
related to human prostate cancer. 1314 

The coincidence of some gene symbols with medical ab- 
breviations, chemical abbreviations and biological abbrevia- 
tions resulted in most of the false positives (see the Supporting 
Information, Supplemental Table 4, or visit http://hlpse- 
q.med.harvard.edu/MedGene/pubiication/s_Table 4.html), em- 
phasizing the importance of the filters that were added in the 
search algorithm (Table 1). Without the filters, the false positive 
rate more than doubled, and the false negative rate rose 
dramatically (data not shown), For example, among the papers 
about hreast cancer, there were only 12 Medline records that 
referred to ESRI and 10 to ESR2, whereas almost 2000 papers 
mentioned estrogen receptor without specifying ESRI or ESR2; 
this latter group was detected by the family stem term filter. 

To further validate these results, a global analysis of the gene- 
disease relationships described by MedGene was performed. 
For this experiment, it was reasoned that the more closely 
related the diseases are to one another, the more they will be 
related to the same gene sets. Thus, if the relationships defined 
by MedGene accurately reflected the literature, then an unsu 
pervised hierarchical clustering of the gene data should group 
diseases in a manner consistent with common medical think- 
ing. Conversely, if the clustered diseases do not make sense 
biologically or medically, it may reflect excessive false positives, 
false negatives, or Inappropriate scoring of the data. 

To execute this experiment, the gene sets and the corre- 
sponding LPF values for 1000 randomly selected diseases (each 
with at least 50 gene relationships) were used as a dataset for 
clustering the diseases. A review of the results showed that the 
resulting disease clusters were indeed logical based upon 
common medical knowledge (see the Supporting Information, 
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Supplemental Figure 1, or visit http://hipseq.med.harvard.edu/ 
MedGene/publication/s_Figure l.htmi). For example, in one 
such cluster shown in Figure 2, diabetes and its complications 
grouped together and were also closely linked to diseases 
associated with starvation states. 

The number of genes associated with a given disease can 
be estimated by adjusting the MedGene number up by the false 
negative rate (~9%) and down by the false positive rate (—26% 
on average). Using this, the average disease has 103.7 ± 45,3 
(mean ± s.d.) genes associated with it, although the range is 
quite broad with 2359 genes related to breast cancer. 2122 
genes related to lung cancer and no genes related to a number 
of diseases. 

Applying MedGene to the Analysis of Large Datasets. Access 
to a comprehensive summary of the genes linked to human 
diseases provided an opportunity to analyze data obtained from 
a high-throughput experiment. We compared the MedGene 
breast cancer gene list to a gene expression data set generated 
from a micro-array analysis comparing breast cancer and 
normal breast tissue samples, Micro-array analysis identified 
2286 genes that had greater than a 1 -fold difference in mean 
expression level between breast cancer samples and normal 
breast samples. Using MedGene, we sorted the 2286 genes into 
four classes; 555 genes directly linked to breast cancer in the 
literature by gene term search (first-degree association by gene 
name); 328 genes directly linked by family term search (first- 
degree association by family term); 1021 genes linked to breast 
cancer only through other breast cancer genes (second-degree 
association); and 505 genes not previously associated with 
breast cancer. (See the Supporting Information, Supplemental 
Figure 2, or visit http://hipseq.med.harvard.edu/MedGene/ 
publicatlon/s_Figure 2.html.) Among the 505 previously un- 
related genes, 467 were either newly identified genes or genes 
that had not previously been associated with any disease. 
Among the remaining 38 genes, 9 had been related to other 
cancers, specifically esophageal, colon, uterine, skin, and cervix. 

To determine whether the genes highlighted by the micro- 
array analysis were more likely to have been previously linked 
to breast cancer in the literature, we created a two-dimensional 
plot of the fold change of expression level between breast 
cancer and normal tissue versus the literature score (LPF) 
(Figure 3A). There was a broad spread of expression changes 
among the genes directly linked to breast cancer ranging from 
less than 1-fold change (68%) to over 40-fold (0.3%). Notably, 
the majority of genes with greater than 10- fold expression 
changes were linked to breast cancer by first-degree associa- 
tion. 

Among all 754 genes directly linked to breast cancer in the 
literature, there was no correlation between LPF and micro- 
array fold change (r= 0.018, p-value — 0.62). However, when 
we stratified the analysis based on the magnitude of the fold 
change, we observed an increasing trend in correlation (Figure 
3B) suggesting that genes with a more substantial change in 
expression level were more likely to have a stronger association 
in the literature. For genes that had 10- fold change or more in 
expression level, the correlation increased to 0.41 (p-value — 
0.05). 

When we evaluated the micro-array data separately for ER 
positive and ER negative tumors, the trend in correlation 
between fold change and literature score was highly dependent 
on estrogen receptor status. Interestingly, there was a similar 
trend in correlation for ER positive tumors, but no trend in 
correlation for ER negative tumors. 
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Diabetic Ketoacidosis 
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Diabetes Mellitus, Non- Insulin- Dependent 

Diabetes Mellitus, Insulin -Dependent 

Pregnancy in Diabetics 

Diabetic Retinopathy 
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Hypoglycemia 

Hyperglycemia 

Diabetes Mellitus , Experimental 

Diabetes Mellitus 

Diabetes, Gestational 
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Brain Edema 1 
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Nutrition Disorders 
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Figure 2. Global validation by clustering analysis. 2(A). The gene sets and the corresponding LPF values for 1000 diseases, each with 
at least 50 gene relationships, were used in an unsupervised clustering of the diseases based on the gene patterns associated with 
them. A sample of the data is shown here. 2(B), One of the resulting clusters is shown that corresponds to blood sugar states. Diabetes 
terms (above the line) and starvation states terms (under the line) clustered together. Within these groups, there is also clustering of 
diabetic small vessel complications, altered serum chemistries, nutritional disorders, etc.(Supp|emental Figure 1: http://hlpseq.med. 
harvard,edu/MedGene/publication/s_Figure l.html). 

Finally, to validate our findings, we computed similar cor- disease unrelated to breast cancer. As expected, we did not 
relations between the breast cancer expression data and observe an increasing trend in correlation for hyperten- 
LPF scores generated by MedGene for hypertension, a sion. 
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Figure 3. Relationship between literature score and functional data for breast cancer. 3A. The data from an expression analysis of 
samples for breast tumors and normal breast tissue were analyzed to indicate the fold difference of expression level between breast 
tumor and normal sample (cutoff > 3-fold change). The fold changes were plotted against the literature score for the same gene set. 
Green dots represent first-degree association by gene search, blue dots represent first-degree association by family search and red 
dots represent no-association, Some well-studied genes, such as BRCA2 (pink circle), are notTeflected by a substantial difference in 
expression level. Furthermore, the majority of genes that have no association with breast cancer in the literature had less than 10-fold 
expression changes (shaded area), 3B. The Spearman rank-correlation coefficients between literature score (LPF) and the fold change 
of expression level between tumor and normal breast samples (y-axis) in relation to the amount of fold change of expression level 
(x-axis). Gene rank lists were generated for breast cancer (blue) and hypertension (pink). Correlations were also computed between 
the breast cancer gene LPF scores and fold change expression data among estrogen receptor positive tumors only (light blue) and 
estrogen receptor negative tumors only (purple). 
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breast neoplasms 



hypertension 



rheumatoid arthritis 



esu ogen receptor 

PGR 

ERBB2 

BRCA1 

BRCA2 

EGFR 

CYP19 

TFF1 

PSEN2 

TP53 

CES3 

CEACAM5 

ERBB3 

cyclin 

COX5A 

cathepsin 

ERBB4 

TRAM 

CCND1 

EGf 

MUC1 

insulin-like 
BCL2 

mucin 
FGF3 



REN 
DBF 
LEP 
AGT 
FNS 

kallikrein 
ACE 

endothelin 

S100A6 

BDK 

DIANPH 
SARI 

PM 

CD59 

ALB 

CYP11B2 
MAT2B 
angiotensin 
receptor 

AGTR2 

NPPA 

LVM 

DBH 
NPY 

POMC 

neuropeptide 



RA 

TNFRSF10A 
CRP 
AS 
ESR1 

HLA-DRB1 
DRt 

interieukin 

TNF 

1L6 

collagen 
ILIA 

ACR 

TNFRSF12 
IL2 

CHI3L1 
ILS 

interieukin 1 
matrix 

metal loproteinase 

interferon 

CD68 

IL4 
IL17 

MMP3 
SIL 



bipolar disorder 



ERDA1 

SNAP29 

PFKL 

DRD2 

TRH 

IMPA2 

HTR3A 

DRD3 

REM 

KCNN3 

DRD4 
HTR2C 

RELN 

DBH 

MAOA 

COMT 

HTR2A 

SYNJ1 

INPPl 

NEDD4L 

FRA13C 

transducer of 

ERBB2 

BAIAP3 

ATP1B3 
DRD5 



atherosclerosis 



apo lipoprotein 

APOE 

LDLR 

ELN 

ARG1 

APOB 

APOA1 

MSR1 

LPl 

PON1 

plasminogen 
activator inhibitor 
PLG 

vascular cell 

adhesion molecule 

ATOM 

VWF 

INS 

ARG2 

ABCA1 

OLRt 

collagen 

MCP 

lipoprotein 
APOA2 
intercellular 
adhesion molecule 
RAB27A 



a MedGene results for the top 25 genes associated with breast neoplasms, hypertension, rheumatoid arthritis, bipolar disorder, and atherosclerosis, respectively 
ranked by LPF scores. The hyperlink to all the papers co-citing the gene and the disease is available at MedGene website (http://hjpseq med. harvard edu/ 
MedGene/) 



Discussion 

The Hu man Genome Project heralded a new era in biological 
research where the emphasis on understanding specific path- 
ways has expanded to global studies of genomic organization 
and biological systems. High-throughput technologies can 
provide novel Insight into comprehensive biological function 
but also introduces new challenges. The utility of these 
technologies is limited to the ability to generate, analyze, and 
interpret large gene lists. MedGene, a relational database 
derived by mining the information in Medline, was created to 
address this need, MedGene users can query for a rank- ordered 
list of human gene-disease relationships (Table 2) for one or 
more diseases. Each entry is hyperlinked to the original papers 
supporting each association and to other relevant databases. 

MedGene is an innovative extension of previous text mining 
approaches. Perez- Iratxeta et al. used the GO annotation and 
their chromosomal locations to predict genes that may con- 
tribute to inherited disorders. 8 MedGene takes a broader view 
and includes all diseases and all possible gene-disease relation- 
ships. Furthermore, MedGene utilizes co citation to Indicate a 
relationship rather than GO annotation, which is limited to the 
subset of genes that have GO annotation. Our approach is 
complementary to that taken by Chaussabel and Sher, who 
used the frequency of co-cited terms to cluster genes into a 
hierarchy of gene-gene relationships. 6 

A unique aspect of this tool is the ability to assess the relative 
strengths of gene-disease relationships based on the frequency 
of both co-cltation and single citation. This presupposes that 
most co-citations describe a positive association, often referred 
to as publication bias 15 and is supported by our observations 



that negative associations are rare (Supplemental Table 3; 
http://hipseq.med.harvard.edu/MedGene/ publication/sta- 
ble 3.html). Of course, relationships established by frequency 
of co-citation do not necessarily represent a true biological link; 
however, it is strong evidence to support a true relationship. 

Another important feature of MedGene is the implementa- 
tion of software filters that substantially reduced the error rate. 
We estimate that less than 10% of all associations were missed 
and at least 70% of even the weakest associations were real. 
For this study, all of the filters that we applied were general 
ones, e.g., expanding the list of all gene names to address the 
different syntax forms used by different journals, eliminating 
gene names that correspond to common English words, etc. 
The majority of the remaining search term ambiguities were 
idiosyncratic and difficult to identify systematically without 
causing a significant rise in false negatives, Alternative ap- 
proaches, such as the examination of the nearest neighbor 
terms, need to be considered to further reduce the false positive 
rate. 

It is not uncommon to see expression changes in micro- 
array experiments as small as 2-fold reported in the literature. 
Even when these expression changes are statistically significant, 
it is not always clear if they are biologically meaningful. When 
comparing expression levels of disease to normal, tissue, one 
expects an enrichment of known disease-related genes to 
appear in the altered expression group. MedGene provided a 
unique opportunity to test this notion in the context of existing 
knowledge on a novel breast cancer micro-array dataset For 
genes displaying a 5 fold change or less in tumors compared 
to normal, there was no evidence of a correlation between 
altered gene expression and a known role in the disease. This 
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Table 3. Genes with Large Expression Changes in ER- but 
Not in ER+ Breast Tumors 



gene symbol 


fold change (ER+) 


fold change (1 


KRTtlBl 


1.0 


610.8 


BRS3 


1.2 


89.4 


DKK1 


1,2 


69.8 


ZIC1 


1.9 


59.6 


TLR1 


1.0 


38.5 


KIAA0680 


2.6 


33.2 


CDKN3 


1.0 


30,6 


EBI2 


4,0 


27.9 


GZMB 


3.8 


21.9 


STKJ8 


4.7 


18.6 


GPR49 


1.0 


14.6 


MYOIO 


1.6 


14.4 


LAD1 


-1.0 


13.5 


POLE2 


4.2 


13.0 


HMG4 


4.4 


12.9 


BCL2L11 


-1.2 


12.3 


LRP8 


2.9 


12.2 


CCNB2 


1.0 


118 


CCNE2 


4.0 


11.6 


FGB 


-4.3 


11.1 


KNSL6 


2.9 


10.9 


H1F5 


3.0 


10.2 


SERPINH2 


4.6 


10.2 


YAP1 


1.0 


10.0 


ipim 


-1.3 


-10.4 


TCEA2 


-1.1 


-10.8 


TFF1 


1.3 


-11.4 


COL17A1 


-4.1 


-15.7 


POPS 


1.1 


-16.2 


BPAG1 


-4.6 


-22.3 


PDZKl 


-1.1 


-36.8 


VEGFC 


-2.8 


-51.5 


MUC6 


-1.4 


-64.9 


SERPINA5 


-1.0 


-83.1 


MEJS1 


-1.6 


-85.9 


CA12 


2.4 


-150.3 



Table 3. MedGene identified a set of relatively understudied, yet highly 
expressed genes in ER negative, but not ER positive breast tumors. All of 
these genes have either never been co-ctted with breast cancer or have a 
weak association except those marked with an *. 



reflects the many genes whose role in breast cancer may not 
involve large changes in expression in sporadic tumors (e.g., 
BRCA1 and BRCA2) and genes whose modest changes in 
expression may be unrelated to the disease. Strikingly, among 
genes with a 10- fold change or more in expression level, there 
was a strong and significant correlation between expression 
level and a published role in the disease, providing the first 
global validation of the micro-array approach to identifying 
disease-specific genes. 

The results derived from MedGene have two implications. 
First, a careful hunt for corroborating evidence of a role in 
breast cancer should precede any further study of genes with 
less than 5- fold expression level changes. Second, any genes 
with 10- fold changes or more are likely to be related to breast 
cancer and warrant attention. It is likely that this threshold will 
change depending on the disease as well as the experiment. 

Interestingly, the observed correlation was only found among 
ER-positive tumors, not ER- negative. This may reflect a bias 
in the literature to study the more prevalent type of tumor in 
the population. Furthermore, this emphasizes that caution 
must be taken when interpreting experiments that may contain 
subpopulations that behave very differently. The MedGene 
approach identified a set of relatively understudied, yet highly 
expressed genes in ER-negatlve tumors that are worthy of 
further examination (Table 3). 



In conclusion, we have developed an automated method of 
summarizing and organizing the vast biomedical literature, To 
our knowledge, the resulting database is the most comprehen- 
sive and accurate ofits kind, By generating a score that reflects 
the strength of the association, it provides an important tool 
for the rapid and flexible analysis of large datasets from various 
high-throughput screening experiments. Furthermore, it can 
be used for selecting subsets of genes for functional studies, 
for building disease- specific arrays, for looking at genes com- 
mon to multiple diseases and various other high-throughput 
applications. In the future, it will be possible to enhance the 
utility of the MedGene database by building links between 
genes and other MeSH terms as well as other biological 
processes and concepts, such as cell division and responses to 
small molecules. 
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